[4mLLAMAFILE-QUANTIZE[24m(1)       General Commands Manual      [4mLLAMAFILE-QUANTIZE[24m(1)

[1mNAME[0m
       llamafile-quantize — large language model quantizer

[1mSYNOPSIS[0m
       [1mllamafile-quantize  [22m[flags...]  [4mmodel-f32.gguf[24m  [[4mmodel-quant.gguf[24m] [4mtype[0m
                          [[4mnthreads[24m]

[1mDESCRIPTION[0m
       [1mllamafile-quantize [22mconverts  large  language  model  weights  from  the
       float32  or float16 formats into smaller data types from 2 to 8 bits in
       size.

[1mOPTIONS[0m
       The following flags are available:

       [1m--allow-requantize[0m
               Allows requantizing tensors that have already  been  quantized.
               Warning:  This can severely reduce quality compared to quantiz‐
               ing from 16bit or 32bit

       [1m--leave-output-tensor[0m
               Will leave output.weight un(re)quantized. Increases model  size
               but may also increase quality, especially when requantizing

       [1m--pure  [22mDisable  k-quant  mixtures and quantize all tensors to the same
               type

[1mARGUMENTS[0m
       The following positional arguments are accepted:

       [4mmodel-f32.gguf[0m
               Is the input file, which contains the unquantized model weights
               in either the float32 or float16 format.

       [4mmodel-quant.gguf[0m
               Is the output file, which will contain quantized weights in the
               desired format. If this path isn't specified, it'll default  to
               [inp path]/ggml-model-[ftype].gguf.

       [4mtype[24m    Is the desired quantization format, which may be the integer id
               of  a supported quantization type, or its name. See the quanti‐
               zation types section below for acceptable formats.

       [4mnthreads[0m
               Number of threads to use during computation (default: nproc/2)

[1mQUANTIZATION TYPES[0m
       The following quantization types are available. This table shows the ID
       of the quantization format, its name, the file size of 7B model weights
       that use it, and finally the amount of quality badness it introduces as
       measured by the llamafile-perplexity tool averaged over 128 chunks with
       the TinyLLaMA 1.1B v1.0 Chat model. Rows are ordered in accordance with
       how recommended the quantization format is for general usage.

       [1m-     [22m18 Q6_K   5.6gb +0.0446 ppl (q6 kawrakow)
       [1m-      [22m7 Q8_0   7.2gb +0.0022 ppl (q8 gerganov)
       [1m-      [22m1 F16    14gb  +0.0000 ppl (best but biggest)
       [1m-      [22m8 Q5_0   4.7gb +0.0817 ppl (q5 gerganov zero)
       [1m-     [22m17 Q5_K_M 4.8gb +0.0836 ppl (q5 kawrakow medium)
       [1m-     [22m16 Q5_K_S 4.7gb +0.1049 ppl (q5 kawrakow small)
       [1m-     [22m15 Q4_K_M 4.1gb +0.3132 ppl (q4 kawrakow medium)
       [1m-     [22m14 Q4_K_S 3.9gb +0.3408 ppl (q4 kawrakow small)
       [1m-     [22m13 Q3_K_L 3.6gb +0.5736 ppl (q3 kawrakow large)
       [1m-     [22m12 Q3_K_M 3.3gb +0.7612 ppl (q3 kawrakow medium)
       [1m-     [22m11 Q3_K_S 3.0gb +1.3834 ppl (q3 kawrakow small)
       [1m-     [22m10 Q2_K   2.6gb +4.2359 ppl (tiniest hallucinates most)
       [1m-     [22m32 BF16   14gb  +0.0000 ppl (canonical but cpu/cuda only)
       [1m-      [22m0 F32    27gb   9.0952 ppl (reference point)
       [1m-      [22m2 Q4_0   3.9gb +0.3339 ppl (legacy)
       [1m-      [22m3 Q4_1   4.3gb +0.4163 ppl (legacy)
       [1m-      [22m9 Q5_1   5.1gb +0.1091 ppl (legacy)
       [1m-     [22m12 Q3_K   alias for Q3_K_M
       [1m-     [22m15 Q4_K   alias for Q4_K_M
       [1m-     [22m17 Q5_K   alias for Q5_K_M
       [1m-   [22mCOPY Only copy tensors, no quantizing.

[1mSEE ALSO[0m
       [4mllamafile[24m(1),      [4mllamafile-imatrix[24m(1),       [4mllamafile-perplexity[24m(1),
       [4mllava-quantize[24m(1), [4mzipalign[24m(1), [4munzip[24m(1)

Llamafile Manual               December 5, 2023          [4mLLAMAFILE-QUANTIZE[24m(1)
